1. Introduction

1.1 Background

In 2018, Capital Metropoitan Transportation Authority (CapMetro), a public transportation agency serving Austin, Travis and parts of Williamson Counties, launched the “Cap Remap” by redesigning the bus network of the city as part of its transit development plan, Connections 2025. Cap Remap adjusted the transit network according to internal analysis and community outreach and aims to provide more fewquent, more reliable, and better connected bus system. Specifically, it remapped certain routes, tripled the number of bus routes that operate every 15 minutes, and made sure the frequency meets the need on weekends.

1.2 Use Case

Given the renewed interest in bus transit in US cities, we think there is an opportunity to innovate in the realm of bus planning using modern data science methods. The goal of this article is to present a tool for planners to test how changes of built environment and/or bus route information impact bus ridership in Austin.

We will start with an evaluation of the Cap Remap to further understand the trend, pattern, and charateristics of the ridership in Austin. Then, we will use the Automated Passenger Counter data and other open data to develope a machine learning model, followed with accuracy result and validation across different regions.

2. Exploratory Analysis

2.1 What Data Are We Using?

The ridership data we used come from the Automated Passenger Counter(APC), which counts the number of boarding and alighting on any given bus. The image below illustrate the APC system at work. source

There are two sets of ridership data that we have used - aggregated and disaggregated. Aggregated data presents average ridership data of every single bus stop in the system on weekdays in all the months from 2017 to 2019. We used the aggregated data to explore the annual ridership trend of the bus system and the neighborhood and school district effect on ridership. And these fixed effects are used later in the model.

Disaggregated data contains more detailed ridership data. It is detailed to each trip within a time period of a week. Therefore, besides the ridership information at a specific bus stop during a specific trip, disaggregated data also provides the trip’s route numbers and route directions so that we can associate those route characteristics with the ridership. The data used here are disaggregated data in May 21-25th and June 11-15th in 2018, which are one week before and after the CapRemap. As mentioned above, the disaggregated data consist of passenger and bus information for each route, each shift, and each stop. Basically, the number of passenger boarding, alighting, and other related ridership information are recorded at each stop for each shift of each bus route. We use the disaggregated data to study the relationship between route characteristics, such as route type, route orientations, and riderships, and further identify popular routes and unpopular routes and create additional route-related features for modelling.

2.2 Annual Citywide Ridership Trend

How did ridership change before and after CapRemap (06/03, 2018)?

Current available data from Capital Metro allows us to observe the trend in ridership change before and after Cap Remap. The first important part of exploratory analysis is to see the city-wide change in ridership brought by CapRemap. With the stop-level data from Janurary 207 to November 2019, the aggregated city-wide ridership change is shown in the chart below.

The x-axis represents month, and y-axis repredents the average daily ridership in the given month. The yellow, blue, red lines represents the monthly ridership in year 2017, 2018, 2019, respectively. The vertical line represents month June, which is the month when Cap Remap rolled out in 2018.

From the trend in 2018, it is clear that ridership fluctuated between months. Year 2017 and 2018 disply similar trend where we see a spike of ridership in June. However, year 2019 seems to have a smoother curve and ridership actually decreased in the same month. We suspect that there might be issues related with data retrieval and will ask for new data. Ridership are generally high in fall and winter in all three years. Our hypothesis is that this could be related with school or university holiday and class schedules. If we ignore the June data and compare the trend across three years, we do see an increasing ridership trend in general, each year has higher ridership than before, which demonstrates the effect of Cap Remap.

2.3 Ridership Pattern in Subdivisions

After knowing the trend of city-wide ridership change, the next question is how the ridership changed across the city: which area experienced ridership increase and which area exprienced ridership decrease. Neighborhoods in Austin are used here to show the spatial trend here.

As shown in the map, darker blue represents higher ridership increase, darker red presents lower ridership increase or even ridership decrease. As shown in the map, mostly downtown areas experienced ridership increase from June to September while the outskirts of Austin experienced low ridership increase or even ridership decrease.

We then created charts showing the ridership change in each neighborhood in 2018 in June and September. There are 12 neighborhoods experienced ridership decrease from June to September. There are several neighborhoods experienced high ridership increase of more than 10,000 from June to September. Generally, most neighborhoods experienced ridership increase after CapRemap from June to September. Among the 78 neighborhoods in Austin, we identified three neighborhoods that represents different characteristics: neighborhoods with expected ridership increase; neighborhoods with unexpected ridership increase; neighborhoods with unexpected ridership decrease.

UT is the neighborhood with expected ridership increase.The location of UT neighborhood is just above downtown neighborhood. With a lot of university students living around here, the bus network is sensitive to school schedule. There is a relatively clear trend in ridership change according to school seasons.

The second neighborhood Govalle is the neighborhood that experiencnig unexpected ridership increase. After CapRemap, the ridership in Govalle nearly increased by 50% to 75%. As Govalle is closer to the outskirts of Austin, this ridership increase might reflects CapRemap’success in strengthening the east-west connection.

But there are also neighborhoods exepriencing ridership decrease on the east-west direction. Zilker located in the southwest side of Austin’s downtown region. Its ridership experienced a gradually slight decrease after CapRemap.

2.4 Route Analysis

2.4.1 Route Type

What could potentially influence ridership in terms of route information?

There are several route types, each serving different purposes. Our hypothesis is that they will play an important role in determing the ridership.

The Austin Bus System is comprised of nine types of routes. The graphs below show six main route types. Since Capital Metro is a regional transit agency so that its service area covers more than City of Austin. Since we mainly focus our analysis and model building in Austin, the basemap below outlines only City of Ausitn.

Regarding route types, the characteristics are listed below:

Local: Capital Metro’s Local routes are intended to connect specific neighborhoods of Austin to Downtown Austin, with frequent stops.

MetroRapid: Capital Metro’s MetroRapid routes is an ostensibly bus rapid transit service serving high-traffic corridors. The service utilizes high-frequency service of every 15 minutes on weekdays with 10 minute service at rush hours.

UT Shuttle: The UT Shuttle system includes a number of routes during the University of Texas semester. They do not operate on Saturdays, except during finals.

Crosstown: Capital Metro’s Crosstown routes are local services between two neighborhoods of Austin, for which the route does not pass through Downtown Austin or the University of Texas.

Limited & Flyer: Capital Metro’s Limited and Flyer routes are limited stop services between two destinations. Limited routes tend to have fewer stops compared to their local counterparts, while Flyer routes serve nonstop between downtown or the UT campus and their neighborhoods of service.

Feeder: Capital Metro’s Feeder routes are local services between a neighborhood and a major transfer point for connecting service.

2.4.2 Hotlines

What makes a good bus system? What’s so special about the ‘hotlines’?

The following analysis aim to find out what routes are popular, why are they popular, and how they have changed in a micro perspective. Kmeans Cluster Analysis was used to separate the disaggregated data into groups. Kmeans is an unsupervised learning algorithm that automatically group the dataset based on the distribution of each feature. We intend to use this algorithm to see if the resulting grouping identifies the hotlines, i.e. the routes that have higher ridership.

We looked at the Kmeans analysis both before and after the Cap Remap. We get 4 lines labeled as hotlines before the remap, 6 lines labeled as hotlines after the remap. The hotlines before and after the remap are plotted below. Most of the hot routes are north-south direction. There are two new hotlines emerged after the CapRemap, line 10 and line 20, and they are colored in red.

To dive deeper into the characteristics of the hot bus lines, we map out the passenger load for each route at each stop for each direction. We also ploted the passenger load versus stop sequence ID as well as average boarding and alighting at each stop along each route. The purpose of this analysis is to first, find out what is so special about the hotlines, and second, see trends before and after the Cap Remap. Note that the Austin bus system has different patterns for each route, and in order to make sure the plots to make sense, we only selected the most used pattern for each plot. Below we chose two Line 20 (type High Frequency) and Line 801(Metro Rapid) to demonstrate detailed route analysis.

Below is the analysis for Line 801.

By mapping and plotting the average passenger number on bus as well as the average boarding and alighting at each stop, we can see better how specific location or neighborhood could potentially contribute to the ridership. These regions will be feature engineered in the following analysis. We also noticed that ridership tends to be higher in the middle portion of the trip, this means a lot of the passengers board from early stops to stops near the ends.

In conclusion, hotlines have the following characteristics:

  • In terms of bus route types, Local, MetroRapid, and High Frequency routes have high ridership
  • In terms of geographical distribution:
  • Go through Hubs (UT, DT, Pleasant Valley)
  • Mostly North-South direction (Following the shape / geography of the city)
  • Going across the a large portion of the city
  • In terms of temporal trend, we know that more Shifts were added in the day time and rush hours, which might increase ridership.

3. Modeling

3.1 Strategies

We will be creating a machine learning model that predicts the ridership at each stop. Specifically, we will look at how Linear, Lasso and Ridge regression, Random Forest, and Xgboost captures the variability in our dataset. We will start with feature engineering, which consists of 5 major categories: amenity, built environment, demographics, route network, stop characteristics (internal data). The hypothesis is that these five categories will influence the ridership at each stop in different ways. The dependent variable we used is the average ridership for each stop in 2019. We use 2019 data because we want to focus on the data after Cap Remap, and our feature engineering aligns with the year 2019 better.

3.2 Feature Engineering

The feature table below demonstrates five types of features and the sources of each feature. All data comes from the following sources: APC aggregated and disaggregated data, Capital Metro, OpenStreetMap (OSM), Open Data Austin, and ACS Census.

For amenity, a buffer of 1/4 miles around each station is created. The number of each type of amenity within the buffer is calculated. In order to capture the distance factor related between stops and amenities, the distance between the stop and the closest 3 amenities are calculated as well.

For the built environment, some features are spatially joined with the stops, such as the neighborhood and the school district data. The area percentage of each type of land use within the buffer is calculated as well. The building density is approximated as a function of building lot area times the building height.

For demographics, we used Areal Weighted Interpolation and joined the weighted census estimates within the buffer to each stop.

For route information, the percentage of each route type passing each bus stop is calculated. The number of shifts going through each stop on a given week is also calculated.

For the internal (stop characteristics) category, we first added a transit hub dummy defined by Capital Metro; we then calculated the spatial lag, which is the average ridership of the surrounding stops.

A series of analysis is conducted to see what features are important. We first looked at the correlation between all features and the dependent variable, the mean ridership in 2019. Below are selected features that highly correlate with ridership positively and negatively. We found that the route information, amenity distance, and stop characteristics often correlates highly with ridership.

A correlation matrix is made to see potential collinearity between all features as well as the dependent variable. We found that certain features correlate highly, such as route direction feature SouthNorth versus WestEast, land use feature commercial versus residential, amenity feature distance to CBD versus distance to train stations. It is important to identify these variables when using features in the model.

3.3 Results and Validation

With the features created in five categories, there are four types of models built: simple linear regression model (lm in the following visualizations), lasso and ridge regression model (glmnet in the following visualizations); random forest model (rf in the following visualizations) and xgboost model (xgb in the following visualizations). As mentioned in the exploratory analysis, neighborhood has been contributing a huge impact on ridership as downtown and UT influence the ridership pattern a lot by its natural built environments or school schedules. Likewise, school districts is another critical impact on ridership. Thus in the modeling validation section, neighborhood and school district are used to test the model’s generalizability.

3.3.1 Neighborhood Fixed Effect

The MAPE and MAE of the four models reveals that, genrally speaking, random forest model is the model with the best accuracy while lasso and ridge model gives prediction less accurately than other three. Lasso and ridge regression has the largest MAPE and MAE while random forest model has the least MAPE and MAE.

In terms of predicted value and actual values, simple linear regression tends to underpredict ridership when the actual ridership is higher than 300.Lasso and ridge model generally overpredict ridership. Random forest model tends to overpredict ridership when the actual ridership is over 250. Xgboost model generally overpredict ridership but performs better than glmnet model when the actual ridership is low.

In order to test the the generaliability of the models on neighborhoods, the following bar chart reveals each model’s MAE in the neighborhoods. The charts demonstrate most of the MAEs are below 100. There are several neighborhoods appear to have a higher MAE among which most of them gathered around UT.

To take a closer look at the neighborhoods, maps of MAPE are plotted to show the generalizability. As mentioned before, the model’s acuuracy is lower in the neighborhoods around University of Texas.

3.3.2 School District Fixed Effect

This drives us to consider the model’s generalizability in school district since UT has been a big factor. According to the MAPE and MAE of the four models, there is a similar trend that lasso and ridge regression is not as accurate as other three models, and random forest model has the best overall accuracy.

The predicting power of each model is also similar to the last one. Simple linear regression model tends to underpredict ridership when the actual ridership is higher than 300.Lasso and Ridge model generally overpredict ridership.Randome forest model tends to overpredict ridership more, when the ridership is higher. Xgboost model generally overpredict ridership but performs better than glmnet model when the actual ridership is low.

Then the chart showing the MAE in different school districts reveals a general MAE below 100 in most school districts. However, in school district 5, where UT is located, the MAE is higher than the rest, which corresponds to our previous findings that, the generalizability of the model is not as wel around UT areas.

As shown in the maps, the darkest blue represents the largest MAPE appear in school district 5. The model’s accuracy still needs to be improved around UT areas.

4. Next Steps

With all of the findings from the modeling results and validation, there are several steps that we are considering to take, in order to improve the model’s generalizability around UT area. - Using ensemble - Tune hyperparameters - Further feature engineering - Use different buffer radius

5. Appendix

We group the disaggregated data by routes, and calculated the max and mean number of passengers on bus at each stop, the average miles traveled and the average hours spent for each passenger at each stop, as well as the total run length and total run time of the route, which bacame the features we used for KMeanse Analysis.

Setup

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  axis.ticks=element_blank())

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=1.5),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

nn_function <- function(measureFrom,measureTo,k) {
  
  nn <-   
    FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
  
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint)
  
  return(output)  
}
#####Data Structure#####
#We use aggregated data to look at the average ridership on weekdays at individual stops
ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
  geom_sf(data = subset(agg_after_sf, STOP_ID == 476), aes(color = "Stop 476"), size = 2, show.legend = "point")+
  scale_colour_manual(values = c("Stop 476" = "darkorange"),
                      guide = guide_legend("Aggregated Data Example"))+
  labs(title = "Aggregated Data Structure",
       subtitle = "Data from Capital Metro")+
  ggrepel::geom_label_repel(
    data = subset(agg_after_sf, STOP_ID == 476),aes(label = "Average Ridership = 33 \n Average Passing Buses = 55", geometry = geometry),
    stat = "sf_coordinates",
    min.segment.length = 3)

#We use disaggregated data to investigate the average ridership on weekdays on different routes.
disagg_803 <- subset(disagg_sf, ROUTE == 803)%>%
  group_by(STOP_ID)%>%
  summarize(avg_on = mean(PSGR_ON),
            avg_load = mean(PSGR_LOAD))
ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
  geom_sf(data = disagg_803, aes(color = "Stops on Route 803"), size = 2, show.legend = "point")+
  scale_colour_manual(values = c("Stops on Route 803" = "darkorange"),
                      guide = guide_legend("Disggregated Data Example"))+
  labs(title = "Disaggregated Data Structure",
       subtitle = "Data from Capital Metro")+
  geom_label_repel(
    data = subset(disagg_803, STOP_ID == 2606),aes(label = "Average On-board Passengers of Stop 2606 = 11 \n Route Type = Metro Rapid", geometry = geometry),
    stat = "sf_coordinates",
    min.segment.length = 0,
    segment.color = "lightgrey",
    point.padding = 20)

Changes on Route Types

local <- ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Local"), color = "lightblue2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Local"), color = "lightblue2",lwd = 0.8,show.legend = FALSE)+
  #scale_colour_manual(values = c("Before Cap Remap" = "lightblue2", "After Cap Remap" = "lightblue2"),
                      #guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
  facet_grid(~capremap)+
  labs(title = "Local Routes Before and After Cap Remap")

#HighFrequency
highFrequency <- ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "High Frequency"), color = "dodgerblue",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "High Frequency"), color = "dodgerblue",lwd = 0.8,show.legend = FALSE)+
  #scale_colour_manual(values = c("Before Cap Remap" = "dodgerblue", "After Cap Remap" = "dodgerblue"),
                      #guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
  facet_grid(~capremap)+
  labs(title = "High Frequency Routes Before and After Cap Remap")

#major changes grid arrange
grid.arrange(local, highFrequency, ncol = 1)
#Crosstown
crosstown <-ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
  #scale_colour_manual(values = c("Before Cap Remap" = "greenyellow", "After Cap Remap" = "greenyellow"),
                      #guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
  facet_grid(~capremap)+
  labs(title = "Crosstown Routes Before and After Cap Remap")

#Feeder
feeder <-ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
  #scale_colour_manual(values = c("Before Cap Remap" = "lightcoral", "After Cap Remap" = "lightcoral"))+
                      #guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
  facet_grid(~capremap)+
  labs(title = "Feeder Routes Before and After Cap Remap")


#Flyer
flyer <- ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
  #scale_colour_manual(values = c("Before Cap Remap" = "magenta2", "After Cap Remap" = "magenta2"),
                     # guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
  facet_grid(~capremap)+
  labs(title = "Flyer Routes Before and After Cap Remap")

#Express
express <-ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
  #scale_colour_manual(values = c("Before Cap Remap" = "red", "After Cap Remap" = "red"),
                      #guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
  facet_grid(~capremap)+
  labs(title = "Express Routes Before and After Cap Remap")

#Special
special <- ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
  #scale_colour_manual(values = c("Before Cap Remap" = "seashell2", "After Cap Remap" = "seashell2"),
   #                   guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
  facet_grid(~capremap)+
  labs(title = "Speical Routes Before and After Cap Remap")

#minor changes grid arrange

grid.arrange(crosstown, feeder, flyer, express, ncol =2)
#UT Shuttle
UTShuttle <- ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "UT Shuttle"), color = "orange",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "UT Shuttle"), color = "orange",lwd = 0.8,show.legend = FALSE)+
  #scale_colour_manual(values = c("Before Cap Remap" = "orange", "After Cap Remap" = "orange"),
                      #guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
  facet_grid(~capremap)+
  labs(title = "UT Shuttle Before and After Cap Remap")

#Nigh Owl
nightowl <- ggplot()+
  geom_sf(data = serviceArea, aes(fill = "Service Areas"))+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Night Owl"), color = "slategray2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Night Owl"), color = "slategray2",lwd = 0.8,show.legend = FALSE)+
  #scale_colour_manual(values = c("Before Cap Remap" = "slategray2", "After Cap Remap" = "slategray2"),
   #                   guide = guide_legend("Routes", override.aes = list(linetype = c("solid", "solid"))))+
  facet_grid(~capremap)+
  labs(title = "Nigh Owl Routes Before and After Cap Remap")

#no change grid arrange
grid.arrange(UTShuttle, special, nightowl, ncol = 2)

Exploratory Analysis

How did ridership change before and after CapRemap (06/03, 2018)?

Ridership Change in Different Neighborhoods in Austin in 2018

Identify the Hotlines

First, let us look at the Kmeans analysis before the CapRemap. We group the disaggregated data by routes, and calculated the max and mean number of passengers on bus at each stop, the average miles traveled and the average hours spent for each passenger at each stop, as well as the total run length and total run time of the route.

Then, we apply Kmeans analysis. The number of clusters are determined by both the elbow chart and the 26 criteria provided by the NbClus package. For more information, see appendix.

We do the same analysis to the disaggregated dataset after the CapRemap.

These clustering labels are joined to the original dataset. For more about the clustering result, please see appendix.

Find the number of kmeans clusters for both before and after the CapRemap:

Both the Elbow chart and the 26 indicies provided by the NbClust package are used to check how many clusters should be used in the Kmeans analysis.

Before CapRemap:

After CapRemap:

In either case, it is evident that the most optimal number for the Kmeans cluster analysis is 3. We then conduct Kmeans analysis with 3 clusters as mentioned above in the exploratory analysis section.

Here is the Kmeans analysis result we got for before and after the CapRemap. The numbers are average of each feature used in the Kmeans analysis. We can clearly see that cluster 2 for both before and after the remapping have the highest average ridership as well as run times. They also have the smallest size. We can conclude that these are the most popular routes and we then define these routes as ‘hotlines’.

routeplot1 <- function(n,p,p1,d) {
  # line n before map
  t1 = ggplot() +
  geom_sf(data = nhood, color = 'grey30',fill = 'grey20') +
  geom_sf(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p) %>%
            st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
            st_transform(2278) %>%
            group_by(STOP_ID) %>%
            summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8), 
          aes(color = mean_stop_load))+
  scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25)) +
  labs(title=paste("Line",n,"Direction 1, Before CapRemap"),
  subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
  
  #line n before passenger load chart
  t11 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_path(aes(x = STOP_SEQ_ID, y = mean_load, 
                size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
    scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25))+
    scale_size_continuous()+
    ylim(0, 23) +
    labs(subtitle=paste("Average Passenger Load"))+plotTheme()+ 
    theme(legend.position="none")
  
  #line n before passenger boarding and alighting
  t12 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
    geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) + 
    geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+ 
    ylim(0, 10) +
    labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+ 
    theme(legend.position="bottom", legend.box = "horizontal")
  
  # line n after map
  t2 = ggplot() +
  geom_sf(data = nhood, color = 'grey30',fill = 'grey20') +
  geom_sf(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p1) %>%
            st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
            st_transform(2278) %>%
            group_by(STOP_ID) %>%
            summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8), 
          aes(color = mean_stop_load))+
  scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25)) +
  labs(title=paste("Line",n,"Direction 1, After CapRemap"),
  subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
  
  #line n after passenger load chart
  t21 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_path(aes(x = STOP_SEQ_ID, y = mean_load, 
                size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
    scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25))+
    scale_size_continuous()+
    ylim(0, 23) +
    labs(subtitle=paste("Average Passenger Load"))+plotTheme()+ 
    theme(legend.position="none")
  
  #line n after passenger boarding and alighting
  t22 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
    geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) + 
    geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+ 
    ylim(0, 10) +
    labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+ 
    theme(legend.position="bottom", legend.box = "horizontal")
  
  grid.arrange(arrangeGrob(t1, t11, t12, ncol = 1, nrow = 3),
               arrangeGrob(t2, t21, t22, ncol = 1, nrow = 3),ncol=2)
}

Feature Engineering

#####census#####
options(tigris_use_cache = TRUE)
v17 <- load_variables(2017, "acs5", cache = TRUE)

Hays <- get_acs(state = "48", county = "209", geography = "tract", 
                variables = "B01001_001", geometry = TRUE)
Travis <- get_acs(state = "48", county = "453", geography = "tract", 
                  variables = "B01001_001", geometry = TRUE)
Williamson <- get_acs(state = "48", county = "491", geography = "tract", 
                      variables = "B01001_001", geometry = TRUE) 

Travis_race <- get_acs(state = "48", county = "453", geography = "tract", 
                       variables = "B02001_002", geometry = TRUE)
Williamson_race <- get_acs(state = "48", county = "491", geography = "tract", 
                           variables = "B02001_002", geometry = TRUE) 

Travis_noveh <- get_acs(state = "48", county = "453", geography = "tract", 
                        variables = "B08014_002", geometry = TRUE)
Williamson_noveh <- get_acs(state = "48", county = "491", geography = "tract", 
                            variables = "B08014_002", geometry = TRUE)

Travis_oneveh <- get_acs(state = "48", county = "453", geography = "tract", 
                        variables = "B08014_003", geometry = TRUE)
Williamson_oneveh <- get_acs(state = "48", county = "491", geography = "tract", 
                            variables = "B08014_003", geometry = TRUE)

Travis_twoveh <- get_acs(state = "48", county = "453", geography = "tract", 
                         variables = "B08014_004", geometry = TRUE)
Williamson_twoveh <- get_acs(state = "48", county = "491", geography = "tract", 
                             variables = "B08014_004", geometry = TRUE)

Travis_threeveh <- get_acs(state = "48", county = "453", geography = "tract", 
                         variables = "B08014_005", geometry = TRUE)
Williamson_threeveh <- get_acs(state = "48", county = "491", geography = "tract", 
                             variables = "B08014_005", geometry = TRUE)

Travis_fourveh <- get_acs(state = "48", county = "453", geography = "tract", 
                           variables = "B08014_006", geometry = TRUE)
Williamson_fourveh <- get_acs(state = "48", county = "491", geography = "tract", 
                               variables = "B08014_006", geometry = TRUE)

Travis_fiveveh <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B08014_007", geometry = TRUE)
Williamson_fiveveh <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B08014_007", geometry = TRUE)

Travis_poverty <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B06012_002", geometry = TRUE)
Williamson_poverty <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B06012_002", geometry = TRUE)
Race <- rbind(Travis_race, Williamson_race)%>%
  st_transform(2278)

Race_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Race, sid = GEOID, weight = "sum",
                                  output = "sf", extensive = "estimate")
Race_buff$race <- round(Race_buff$estimate)

NoVeh <- rbind(Travis_noveh, Williamson_noveh)%>%
  st_transform(2278)

NoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = NoVeh, sid = GEOID, weight = "sum",
                            output = "sf", extensive = "estimate")
NoVeh_buff$estimate <- round(NoVeh_buff$estimate)

OneVeh <- rbind(Travis_oneveh, Williamson_oneveh)%>%
  st_transform(2278)

OneVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = OneVeh, sid = GEOID, weight = "sum",
                             output = "sf", extensive = "estimate")
OneVeh_buff$estimate <- round(OneVeh_buff$estimate)

TwoVeh <- rbind(Travis_twoveh, Williamson_twoveh)%>%
  st_transform(2278)

TwoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = TwoVeh, sid = GEOID, weight = "sum",
                              output = "sf", extensive = "estimate")
TwoVeh_buff$estimate <- round(TwoVeh_buff$estimate)

ThreeVeh <- rbind(Travis_threeveh, Williamson_threeveh)%>%
  st_transform(2278)

ThreeVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = ThreeVeh, sid = GEOID, weight = "sum",
                              output = "sf", extensive = "estimate")
ThreeVeh_buff$estimate <- round(ThreeVeh_buff$estimate)

FourVeh <- rbind(Travis_fourveh, Williamson_fourveh)%>%
  st_transform(2278)

FourVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FourVeh, sid = GEOID, weight = "sum",
                                output = "sf", extensive = "estimate")
FourVeh_buff$estimate <- round(FourVeh_buff$estimate)

FiveVeh <- rbind(Travis_fiveveh, Williamson_fiveveh)%>%
  st_transform(2278)

FiveVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FiveVeh, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
FiveVeh_buff$estimate <- round(FiveVeh_buff$estimate)

Poverty <- rbind(Travis_poverty, Williamson_poverty)%>%
  st_transform(2278)

Poverty_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Poverty, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
Poverty_buff$estimate <- round(Poverty_buff$estimate)
#Building Area Feature Engineering
#Create the polygon buffer function
bufferPoly <- function(Buffer, Polygons, Name){
  if(class(Polygons$geometry) == "sfc_POLYGON"){
    Poly <- st_join(Buffer%>% select(STOP_ID), Polygons, join = st_intersects)%>%
      group_by(STOP_ID)%>%
      summarize(area = sum(Total_area))%>%
      rename(!!Name := area)
  }
  #else {
  #  Poly <- st_join(Buffer%>% select(STOP_ID), Polygons, join = st_intersects)%>%
  #    group_by(STOP_ID)%>%
  #    summarize(count = n())%>%
  #    rename(!!Name := count)
  #}
}

#Import building footprint shapefile
Buildings <- 
  st_read("C:/Upenn/Practicum/Data/building_footprints_2017/building_footprints_2017.shp")%>%
  st_transform(2278) 
#Import Stop shp
stops <-
  st_read("C:/Upenn/Practicum/Data/Shapefiles_-_JUNE_2018/Stops.shp") %>%
  st_transform(2278)

#Create columns of the num of floors and total building areas
Buildings$Floor <- round(Buildings$MAX_HEIGHT/10)
Buildings$Total_area <- Buildings$Floor * Buildings$SHAPE_AREA

AreaPoly <- bufferPoly(StopBuff, Buildings, 'building_area')
AreaPoly$geometry <- NULL

AreaPoly2 <- bufferPoly(StopBuff2, Buildings, 'building_area')

write.csv(AreaPoly, "C:/Upenn/Practicum/Data/Building_Area2.csv")

write.csv(AreaPoly2, "C:/Upenn/Practicum/Data/Building_Area2.csv")

?merge

Austin.sf <- merge(AreaPoly, data.2018, by= "STOP_ID", all.x=TRUE)

Austin.sf$geometry <- NULL
Austin.sf<- Austin.sf[-c(25158,25159,25160,25161,25162,25163,25164,35408,37475,37476,37477,37478,37479,37480,37481,39140,39153,39292,42526,42527,42528,42529,42530,42531,
                         42532,42533,42534,44292,44293,45249,45250,48882,48883,48915,48916), ]
which(is.na(Austin.sf$LONGITUDE))

Austin.sf <- 
  Austin.sf %>% 
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
  st_transform(2278)

library(wesanderson)
palette5 <- wes_palette("Moonrise3", n = 5)
palette5 <- 
  mapTheme <- function(base_size = 12) {
    theme(
      text = element_text( color = "black"),
      plot.title = element_text(size = 14,colour = "black"),
      plot.subtitle=element_text(face="italic"),
      plot.caption=element_text(hjust=0),
      axis.ticks = element_blank(),
      panel.background = element_blank(),axis.title = element_blank(),
      axis.text = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      panel.grid.minor = element_blank(),
      panel.border = element_rect(colour = "black", fill=NA, size=2)
    )
  }

ggplot() +
  geom_sf(data = nhood, fill = "grey40") +
  geom_sf(data = Austin.sf, aes(colour = q5(building_area)), show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(Austin.sf,"building_area"),
                      name="Quintile\nBreaks") +
  labs(title="Building Area within 1/4 Mile Buffer from each Stop") +
  mapTheme()

join all data

all_x1 = CommercialInit %>%  #amenities and route related
  left_join(st_drop_geometry(RetailInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(OfficeInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ResidentialInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(SupermktInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(BarInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(UniInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ParkingInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(SchoolInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(StationInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(StadiumInit), by = "STOP_ID") %>%
  left_join(stop_dir_freq, by = "STOP_ID") %>%
  left_join(stop_type_freq, by = "STOP_ID") %>%
  left_join(stop_hot_freq, by = "STOP_ID") %>%
  left_join(build_dens, by = "STOP_ID") %>%
  left_join(st_drop_geometry(stop_buff_landuse), by = "STOP_ID") %>%
  left_join(st_drop_geometry(Race_buff) %>% select(STOP_ID, race) %>% mutate(STOP_ID = as.numeric(STOP_ID), race = as.numeric(race)), by = "STOP_ID") %>% #census data
  left_join(st_drop_geometry(Population_buff) %>% select(STOP_ID, population) %>% mutate(STOP_ID = as.numeric(STOP_ID), population = as.numeric(population)), by = "STOP_ID") %>% 
  left_join(st_drop_geometry(NoVeh_buff) %>% select(STOP_ID, NoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), NoVeh = as.numeric(NoVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(OneVeh_buff) %>% select(STOP_ID, OneVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), OneVeh = as.numeric(OneVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(TwoVeh_buff) %>% select(STOP_ID, TwoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), TwoVeh = as.numeric(TwoVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ThreeVeh_buff) %>% select(STOP_ID, ThreeVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), ThreeVeh = as.numeric(ThreeVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(FourVeh_buff) %>% select(STOP_ID, FourVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FourVeh = as.numeric(FourVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(FiveVeh_buff) %>% select(STOP_ID, FiveVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FiveVeh = as.numeric(FiveVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(Poverty_buff) %>% select(STOP_ID, Poverty) %>% mutate(STOP_ID = as.numeric(STOP_ID), Poverty = as.numeric(Poverty)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(stop_nhood), by = "STOP_ID") %>% # fixed effects
  left_join(st_drop_geometry(stop_school), by = "STOP_ID") %>%
  select(-c(hotline_0, X)) %>%
  left_join(data.2019.mean, by = "STOP_ID")

#spatial lag, knn dist
all_x3 = bind_cols(list(all_x1, utaustinDist, CBDDist, commercialDist, retailDist, supermktDist, officeDist, residentialDist, schoolDist, universityDist, parkingDist, stadiumDist, trainstationDist, universityDist, airportDist, spatial_lag %>% select(spatial_lag2, spatial_lag3, spatial_lag5)))

#time lag

correlation plots

Modeling and Validation

Use Neighborhood For Validation

###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "MAPE of each model on the testing set")
  theme_bw()
#MAE chart
ggplot(data = val_preds %>% 
           dplyr::select(model, MAE) %>% 
           distinct() , 
         aes(x = model, y = MAE, group = 1)) +
    geom_path(color = "blue") +
    geom_label(aes(label = paste0(round(MAE,1)))) +
    labs(title= "MAE of each model on the testing set")
  theme_bw()  
  
#Predicted vs Observed
ggplot(val_preds, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
  theme_bw()

#Neighborhood validation
val_MAPE_by_hood <- val_preds %>% 
  group_by(label, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


val_MAPE_by_hood %>%
  dplyr::select(label, model, MAE) %>%
  gather(Variable, MAE, -model, -label) %>%
  ggplot(aes(label, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  scale_fill_manual(values = "Blues") +
  facet_wrap(~label,scales="free", ncol=6)+
  labs(title = "Mean Absolute Errors by model specification and neighborhood") +
  plotTheme()

#Map of MAE in each neighborhood
#Add geometry to the MAE
MAE.nhood <- merge(nhood, val_MAPE_by_hood, by.x="label", by.y="label", all.y=TRUE)

#Produce the map

#Map: MAPE of lm
MAE.nhood%>%
  filter(model=="lm") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of lm in Neighborhoods") +
  mapTheme()

#Map: MAPE of glmnet
MAE.nhood%>%
  filter(model=="glmnet") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of glmnet in Neighborhoods") +
  mapTheme()
#MAPE of rf
MAE.nhood%>%
  filter(model=="rf") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of rf in Neighborhoods") +
  mapTheme()

#MAPE of xgb
MAE.nhood%>%
  filter(model=="xgb") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of xgb in Neighborhoods") +
  mapTheme()

Use School District For Validation

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned) 
glmnet_best_OOF_preds <- collect_predictions(glmnet_tuned) %>% 
  filter(penalty  == glmnet_best_params$penalty[1] & mixture == glmnet_best_params$mixture[1])
rf_best_OOF_preds <- collect_predictions(rf_tuned) %>% 
  filter(mtry  == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])

xgb_best_OOF_preds <- collect_predictions(xgb_tuned) %>% 
  filter(mtry  == xgb_best_params$mtry[1] & min_n == xgb_best_params$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo     <- collect_predictions(lm_val_fit_geo)
glmnet_val_pred_geo <- collect_predictions(glmnet_val_fit_geo)
rf_val_pred_geo     <- collect_predictions(rf_val_fit_geo)
xgb_val_pred_geo    <- collect_predictions(xgb_val_fit_geo)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds <- rbind(data.frame(dplyr::select(lm_best_OOF_preds, .pred, mean_on), model = "lm"),
                   data.frame(dplyr::select(glmnet_best_OOF_preds, .pred, mean_on), model = "glmnet"),
                   data.frame(dplyr::select(rf_best_OOF_preds, .pred, mean_on), model = "RF"),
                   data.frame(dplyr::select(xgb_best_OOF_preds, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup()


# Aggregate predictions from Validation set
val_preds <- rbind(data.frame(lm_val_pred_geo, model = "lm"),
                   data.frame(glmnet_val_pred_geo, model = "glmnet"),
                   data.frame(rf_val_pred_geo, model = "rf"),
                   data.frame(xgb_val_pred_geo, model = "xgb")) %>% 
  left_join(., data %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(TRUSTEE, .row), 
            by = ".row") %>% 
  group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup()
###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "MAPE of each model on the testing set")
theme_bw()
#MAE chart
ggplot(data = val_preds %>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "MAE of each model on the testing set")
theme_bw()  

#Predicted vs Observed
ggplot(val_preds, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

#Neighborhood validation
val_MAPE_by_district <- val_preds %>% 
  group_by(TRUSTEE, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 20) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 50,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=30),
    axis.title = element_text(size=30),
    axis.text = element_text(size=20),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 50),
    legend.text = element_text(colour = "black", face = "italic",size = 50),
    strip.text.x = element_text(size = 30)
  )
}


val_MAPE_by_district %>%
  dplyr::select(TRUSTEE, model, MAE) %>%
  gather(Variable, MAE, -model, -TRUSTEE) %>%
  ggplot(aes(TRUSTEE, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  scale_fill_manual(values = palette4) +
  facet_wrap(~TRUSTEE,scales="free", ncol=8)+
  ylim(0,200)+
  labs(title = "Mean Absolute Errors by model specification and school districts") +
  plotTheme()

#Map of MAE in each school district
#Load school district data
TrusteeDist <- 
  st_read("C:/Upenn/Practicum/Data/Trustee_Boundaries/Trustee.shp")%>%
  st_transform(2278) 
#Add geometry to the MAE
MAE.district <- merge(TrusteeDist, val_MAPE_by_district, by.x="TRUSTEE", by.y="TRUSTEE", all.y=TRUE)

#Produce the map


#Map: MAPE of lm
MAE.district%>%
  filter(model=="lm") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.district,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of lm in School Districts") +
  mapTheme()

#Map: MAPE of glmnet
MAE.district%>%
  filter(model=="glmnet") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.district,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of glmnet in School Districts") +
  mapTheme()
#MAPE of rf
MAE.district%>%
  filter(model=="rf") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.district,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of rf in School Districts") +
  mapTheme()

#MAPE of xgb
MAE.district%>%
  filter(model=="xgb") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.district,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of xgb in School Districts") +
  mapTheme()